New Feature Sets for Summarization by Sentence Extraction

نویسنده

Hans van Halteren

چکیده

they don’t necessarily provide a coherent account— be used as a basis for further processing. Ideally, the document would be thoroughly analyzed using linguistic and world knowledge to determine which sentences are appropriate for the extract. In practice, the necessary analysis is still too immature or too computationally intensive to yield sufficient results. Many existing systems extract sentences on the basis of a limited set of mundane features. Still, the features that are most often used tend to be based on notions about document structure (for example, sentence position within the document, sentence length, repetition of words from the title or headings, selected cue words or phrases) or information content (for example, presence of high-frequency content words). The sidebar “Automatic Summarization” provides more details. I wanted to investigate whether I could enhance the feature-based extraction strategy by including some features that weren’t primarily information retrieval oriented. The features I wanted to try were originally developed to recognize writing style and so were somewhat geared toward reducing the influence of subject-specific (IR) elements. The underlying idea is that, when you try to summarize an article through sentence extraction, you’re assuming that the most important information is concentrated in specific sentences. If this is indeed true, the article’s author, knowing which sentences are the most important, might have consciously or subconsciously written these sentences in a different style (measurable, recurring patterns in the usage of vocabulary, grammar, text structure, and so on) from the rest of the article. If true, this supposition would likely allow valuable additions to the sentence extraction toolbox. People study automatic writing-style recognition primarily in the context of authorship attribution. In this task, you examine a certain text and try to determine which out of a given group of authors wrote it.4 The decision is based on information about several style markers, or features, such as vocabulary size or the distribution of a small set of specific vocabulary items. You generally learn about the markers from inspecting other texts by the same authors. One main focus of authorship attribution research is to create an inventory of useful style markers.5 Another important focus is to develop techniques for using these style markers to provide reliable-enough probability estimates for each potential author.6 I developed a new technique for estimating probabilities as well as several sets of style markers that complement this technique. I wanted to find out if this technique and these features could also be used to locate extractable sentences in a document.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Results of CRL/NYU System at DUC-2003 and an Experiment on Division of Document Sets

We participated in three multi-document summarization tasks at the DUC-2003 formal run and evaluated the performance of our summarization system. Our summarization system based on sentence extraction also incorporated a module to estimate similarity between sentences for multi-document summarization. The similarity information was used for selecting the representative sentence among similar sen...

متن کامل

Comparison of Feature Usage at TSC-3 Summarization Tasks

We participated in two summarization tasks at the TSC-3. We have introduced categorization of feature values for our summarization system, which is based on sentence extraction technique. The categorized values were used as features for generating a decision tree. We compared our summarization system using the categorization of feature values with the one using linear combination of features in...

متن کامل

CRL/NYU Summarization System at DUC-2004

We participated in two multi-document summarization tasks (Task 2 and Task 5) at the DUC-2004 formal run and evaluated the performance of our summarization system. Our system based on sentence extraction also uses a module to estimate similarity between sentences. The similarity information was used for either selecting the representative sentence among similar sentences or gathering key senten...

متن کامل

A Survey For Multi-Document Summarization

Automatic Multi-Document summarization is still hard to realize. Under such circumstances, we believe, it is important to observe how humans are doing the same task, and look around for different strategies. We prepared 100 document sets similar to the ones used in the DUC multi-document summarization task. For each document set, several people prepared the following data and we conducted a sur...

متن کامل

CRL/NYU System at DUC-2004

متن کامل

Fuzzy Logic Based Method for Improving Text Summarization

Text summarization can be classified into two approaches: extraction and abstraction. This paper focuses on extraction approach. The goal of text summarization based on extraction approach is sentence selection. One of the methods to obtain the suitable sentences is to assign some numerical measure of a sentence for the summary called sentence weighting and then select the best ones. The first ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IEEE Intelligent Systems

دوره 18 شماره

صفحات -

تاریخ انتشار 2003

New Feature Sets for Summarization by Sentence Extraction

نویسنده

چکیده

منابع مشابه

Results of CRL/NYU System at DUC-2003 and an Experiment on Division of Document Sets

Comparison of Feature Usage at TSC-3 Summarization Tasks

CRL/NYU Summarization System at DUC-2004

A Survey For Multi-Document Summarization

CRL/NYU System at DUC-2004

Fuzzy Logic Based Method for Improving Text Summarization

عنوان ژورنال:

اشتراک گذاری